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Abstract. The empirically successful Thompson Sampling algorithm 
for stochastic bandits has drawn much interest in understanding its the¬ 
oretical properties. One important benefit of the algorithm is that it 
allows domain knowledge to be conveniently encoded as a prior distri¬ 
bution to balance exploration and exploitation more effectively. While it 
is generally believed that the algorithm’s regret is low (high) when the 
prior is good (bad), little is known about the exact dependence. This 
paper is a hrst step towards answering this important question: focusing 
on a special yet representative case, we fully characterize the algorithm’s 
worst-case dependence of regret on the choice of prior. As a corollary, 
these results also provide useful insights into the general sensitivity of 
the algorithm to the choice of priors, when no structural assumptions are 
made. In particular, with p being the prior probability mass of the true 
reward-generating model, we prove Oi^sjTjp) and 0(\/(l — p)T) regret 
upper bounds for the poor- and good-prior cases, respectively, as well 
as matching lower bounds. Our proofs rely on a fundamental property 
of Thompson Sampling and make heavy use of martingale theory, both 
of which appear novel in the Thompson-Sampling literature and may be 
useful for studying other behavior of the algorithm. 


1 Introduction 

Thompson Sampling (TS), also known as probability matching and posterior 
sampling, is a popular strategy for solving stochastic bandit problems. An im¬ 
portant benefit of this algorithm is that it allows domain knowledge to be con¬ 
veniently encoded as a prior distribution to address the exploration-exploitation 
tradeoff more effectively. In this paper, we focus on the sensitivity of the algo¬ 
rithm to the prior it uses. In the rest of this section, we hrst dehne the bandit 
setting and notation, and describe Thompson Sampling; we will then discuss 
previous works that are most related to the present paper. 

1.1 Thompson Sampling for Stochastic Bandits 

In the multi-armed bandit problem, an agent is repeatedly faced with K possible 
actions. At each time step t = 1,... ,T, the agent chooses an action It G A := 
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{1,..., K}, then receives reward G An eligible action-selection strategy 
chooses actions at step t based only on past observed rewards Ht = {/g, 1 < 

s < t} and potentially on an external source of randomness. More background 
on the bandit problem can be found in a recent survey [5]. 

We make the following stochastic assumption on the underlying reward¬ 
generating mechanism. Let (9 be a countabl^set of possible reward-generating 
models. When 9 G 0 is the true underlying model, the rewards (W,i)t>i are i.i.d. 
random variables taking values in [0,1] drawn from some known distribution 
i'i{9) with mean ^i{9). Of course, the agent knows neither the true underlying 
model nor the optimal action that yields the highest expected reward. The per¬ 
formance of the agent is measured by the regret incurred for not always selecting 
the optimal action. More precisely, the frequentist regret (or regret for short) for 
an eligible action-selection strategy tt under a certain reward-generating model 
9 is defined as 


RT(6',7r) := Ey^ ( max^i(6*) - /r/j(6>) 


( 1 ) 


where the expectation is taken with respect to the rewards gener¬ 

ated according to the model 9, and the potential external source of randomness. 

If one imposes a prior distribution p over 6>, then it is natural to consider 
the following notion of average regret known as Bayes regret. 


Rr(7r) := Ee...p Rt(6', tt) = V' Rt( 6», 7r)p(6<). 


See 


( 2 ) 


The Thompson Sampling strategy was proposed in probably the very first 
paper on multi-armed bandits [29] . This strategy takes as input a prior distribu¬ 
tion Pi for 9 G 0. ki each time t, let pt be the posterior distribution for 9 given 
the prior pi and the history Ht = {.fs) 1 < s < t}. Thompson Sampling 

selects an action randomly according to its posterior probability of being the 
optimal action. Equivalently, Thompson Sampling first draws a model 9t from 
Pt (independently from the past given pt) and it pulls It G argmax^g^/ii(6*t). 
For concreteness, we assume that the distributions (t'i(0))iG^,eee absolutely 
continuous with respect to some common measure ly on [0,1] with likelihood 
functions (£i(0)(-))ig^_ege. The posterior distributions pt can be computed re¬ 
cursively by Bayes rule as follows: 


Pt+i{0) = 


Pt{9)IiM{Xia) 

T,r,e0Pt{vyh{v){ki,,t)' 


We denote by TS(pi) the Thompson Sampling strategy with prior pi. 


Note that in this paper, we do not impose any continuity structure on the reward 
distributions iy{6) with respect to 9 G 0. Therefore, it is easy to see that when 
& is uncountable, the (frequentist) regret of Thompson Sampling, as defined in 
Equation!^ in the worst-case scenario is linear in time under most underlying models 

9g0. 




Two remarks are in order. First, the setup above is a discretized version of 
rather general bandit problems. For example, the /f-armed bandit is a special 
case, where 0 is the Cartesian product of the sets of reward distributions of 
all arms. As another example, in linear bandits [imi], 0 is a set of candidate 
coefficient vectors that determine the expected reward function. Discretization 
of 0 provides a convenient yet useful approximation that leads to simplicity 
in expositions and analysis. Such an abstract formulation is analogous to the 
expert setting widely studied in the online-learning literature m; also see a 
recent study of Thompson Sampling with 2 and 3 experts m- 

Second, although we assume reward are bounded, some results in the pa¬ 
per, especially Lemma that may be of independent interest, still hold with 
unbounded rewards. 

1.2 Related Work 

Recently, Thompson Sampling has gained a lot of interest, largely due to its 
empirical successes [n m US [2H]- Furthermore, this strategy is often easy to 
be combined with complex reward models and easy to implement [niiniisni- 
While asymptotic, no-regret results are known these empirical successes 
inspired finite-time analyses that deepen our understanding of this old strategy. 

For the classic AT-armed bandits, regret bounds comparable to the the more 
widely studied UCB algorithms are obtained [IIIIIIIIH], matching a well-known 
asymptotic lower bound [H]. For linear bandits of dimension d, an 0{dyjTK) 
upper bound has been proved All these bounds, while providing interest¬ 
ing insights about the algorithm, assume non-informative priors (often uniform 
priors), and essentially show that Thompson Sampling has a comparable regret 
to other popular strategies, especially those based on upper confidence bounds. 
Unfortunately, the bounds do not show what role prior plays in the performance 
of the algorithm. In contrast, a variant of Thompson Sampling is proposed, with 
a bound that depends explicitly on the entropy of the prior [53] . However, their 
bound has an 0(T^/^) dependence on T that is likely sub-optimal. 

Another line of work in the literature focuses on the Bayes regret with an 
informative prior. Previous work has shown that, for any prior in the two¬ 
armed case, TS is a 2-approximation to the optimal strategy that minimizes 
the “stochastic” (Bayes) regret [T7j . It has also been shown that in the K-armed 
case, the Bayes regret of TS is always upper bounded by 0{-\/KT) for any 
prior PESj. These results were later improved m to a prior-dependent bound 
0 {^yH{q)KT) where q is the prior distribution of the optimal action, defined 
as q{i) = F 0 ^p^{i = argmax^-g^/ij(6»)), and H{q) = - <?(*) log <?(*) is the 

entropy of q. While this bound elegantly quantifies, in terms of averaged re¬ 
gret, how Thompson Sampling exploits prior distributions, it does not tell how 
well Thompson Sampling works in individual problems. Indeed, in the analy¬ 
sis of Bayes regret, it is unclear what a “good” prior means from a theoretical 
perspective, as the definition of Bayes regret essentially assumes the prior is cor¬ 
rectly specified. In the extreme case where prior pi is a point mass, H{q) = 0 
and the Bayes regret is trivially 0. 



To the best of our knowledge, our work is the first to consider frequentist 
regret of Thompson Sampling with an informative prior. Specifically, we focus 
on understanding TS’s sensitivity to the choice of prior, making progress towards 
a better understanding of such a popular Bayesian algorithm. It is shown that, 
while a strong prior can lower the Bayes regret substantially m, such a benefit 
comes with a cost: if the true model happens to be assigned a low prior (the poor- 
prior case), the frequentist regret will be very large, which is consistent with 
a recent result on Pareto regret frontier |22j . Our findings suggest Thompson 
Sampling can be under-exploring in general. Techniques like those in the “mini¬ 
monster” algorithm [5] may be necessary to modify Thompson Sampling to make 
it less prior-sensitive. It is an open question whether such modified Thompson 
Sampling algorithms can still take advantage of an informative prior to enjoy a 
small Bayes regret. 

Finally, our analysis makes critical use of a certain martingale property of 
Thompson Sampling. Although martingales have been applied to hypothesis 
testing, for example, in analyzing the statistical behavior of likelihood ratios [3, 
our use of martingales to analyze the behavior of posteriors in TS is new, to 
the best of our knowledge. Moreover, a different martingale property was used 
by other authors to study the Bayesian multi-armed bandit problem, where 
the reward at the current “state” is the same as the expected reward over the 
distribution of the next state when a play is made in the current state mun]. 
Their martingale property is different from ours: their martingales apply to the 
reward at the current state, while ours refers to the inverse of the posterior 
probability mass of the true model (see Section for details). 

2 Main Results 

Naturally, we expect the regret of Thompson Sampling to be small when the true 
reward-generating model is given a large prior probability mass, and vice versa. 
An interesting and important question is to understand the sensitivity of the 
algorithm’s regret to the prior it takes as input. We take a minimalist approach, 
and investigate a special yet meaningful case. Our results fully characterize the 
worst-case dependence of TS’s regret on the prior, which also provides important 
insights into a more general case as a corollary. Furthermore, our analysis appears 
novel to the best of our knowledge, making heavy use of martingale techniques 
to analyze the behavior of the posterior probability. Such techniques may be 
useful for studying other bandit algorithms. 

Similar to the expert setting nni, we assume access to a set of candidate mod¬ 
els, O = {01, 02, ■ ■ ■, 9n} for N > 2. This setting is referred to as AT-Actions- 
And-A-Models, where K is the cardinality of the action set. For simplicity, in 
this work, we restrict ourselves to the binary action case: K = 2. Finally, the 
special case with N = 2 and A = 2 is called 2-Actions-And-2-Models. 

Two comments are in order. First, our goal in this work is not to solve 
these specialized bandit problems, but rather to understand prior sensitivity of 
TS. Such seemingly simplistic problems happen to be nontrivial enough to be 


useful in our constructive proof of matching lower bounds. Second, we aim to 
understand TS’s prior sensitivity without making any structural assumptions 
about 0. A natural next step of this work is to investigate, with a structural 0 
(e.g., linear), how robust TS is to the prior. 

Our upper-bound analysis requires the following smoothness assumption of 
the likelihood functions of models in 0. Note that this assumption is needed 
only in the upper-bound analysis, but not in the lower-bound proofs. 

Assumption 1 (Smoothness) There exists eonstant s > 1 such that u-almost 
surely, for i G {1, 2}, • £i{0i) < ti{92) < s ■ 

Remark 1. While this assumption does not hold for all distributions, it holds 
for some important ones, such as Bernoulli distributions Bern{p) with mean 
p G (0,1). On one hand, the assumption essentially avoids situations where 
a single application of Bayes rule can change posteriors by too much, analo¬ 
gous to bounded gradients or rewards in most online-learning literature. On the 
other hand, a small s value in the assumption tends to create hard problems 
for Thompson Sampling, since models are less distinguishable. Therefore, the 
assumption does not trivialize the problem. 

The first main result of this paper is the following upper bound; see Section]^ 
for more details: 

Theorem 1. Consider the 2-Actions-And-2-Models case and assume that As- 
sumption\^holds. Then, the regret of Thompson Sampling with priorpi satisfies 
R7’(0i,TS( pi)) = 0{s^/T/pl{9l)). Moreover, when pi{9i) > 1 — we have 
Rt(0i,TS(pi)) = 0(sV(l - Pi{di))T). 

Remark 2. The above upper bounds have the same dependence on T and pi{9i) 
as the lower bounds to be given in Theorems and below. Moreover, both 
bounds are increasing functions of the smoothness parameter s. Because prob¬ 
lems with small s tend to be harder for Thompson Sampling, our upper bounds 
are tight up to a universal constant for a fairly general class of hard problems. 
We conjecture that the dependence on s is an artifact of our proof techniques 
and can be removed to get tighter upper bounds for all problem instances of the 
2-Actions-And-2-Models case. 

The next two theorems give matching lower bounds for the poor- and good- 
prior cases, respectively. More details are given in Section 

Theorem 2. Consider the 2-Actions-And-2-Models case. Let pi he a prior dis¬ 
tribution and T > . Consider the following specific problem instance: 

'^i{9i) = Bern {\-\- A), viidf) = Bern — A), V2{9i) = V2{92) — Bern (i), 
where A = Ij^J^>Pl{9l)T. Then, the regret of Thompson Sampling with priorpi 
satisfies the following: if pi{9i) < £, then Rt{9i,TS{pi)) > 









Theorem 3. Consider the 2-Actions-And-2-Models case. Let pi be a prior 
distribution and T > ■ Consider the following specific problem in¬ 

stance with Bernoulli reward distributions: r'liOi) = *^ 1 (^ 2 ) = Bern(^^^, 
U2{9 i) = Bern{\- A), U2{92) = Bern{\+ A), where A = s.(i-pl{9iW ' 
Then the regret of Thompson Sampling with prior pi satisfies Rt(0i, TS'(pi)) > 
5i^V(i-Pi(»i))r. 


The lower bounds in the 2-Actions-And-2-Models case easily imply the lower 
bounds in the general case. 

Corollary 1. (General Lower Bounds) Consider the case with two actions and 
an arbitrary countable 0. Let pi be a prior over 0 and 0* G 0 be the true 
model. Then, there exist problem instances where the regrets of Thompson Sam¬ 
pling are ^{,\J^'^9, 17(\/(1 — pi{9*))T) for small pi(9*) and large pi{9*), 
respectively. 


Remark 3. These lower bounds show that the performance of Thompson Sam¬ 
pling can be quite sensitive to the choice of input prior, especially when the prior 
is poorly chosen. 

Due to space limit, we can only include the more important, novel or challeng¬ 
ing parts of the analysis in the paper. A complete proof, together with simulation 
results corroborating our theoretical findings, are given in a full version |24j . 


2.1 Comparison to Previous Results 

Note that an upper bound in the AT-Actions-And-A^-Models case can be derived 
from an earlier result which upper-bounds the Bayes regret, L{.t{TS{pi))-. 


Rt{9i,TS{pi)) < 


Rt{TS{pi)) 

Pi{9i) 


= O 


( ^H{q)KT \ 

\ Pi{9i) ) 


where S 0 is the unknown, true model. On one hand, in the 2-Actions-And- 
2-Models case, the above upper bound becomes O 




Pl(91 


for 


small pi(6»i), and O ^^og “ Pi(^i))^^ for large Pi(6»i). Our up¬ 

per bounds in Theoremfl] remove the extraneous logarithmic terms in these upper 


bounds. On the other hand, the above general upper bound can be further upper 

bounded by O for small pi(6»i) and O log (1 “ 

for large pi{9i). We conjecture that these general upper bounds can be improved 
to match our lower bounds in Corollaryespecially for small pi{9i). But it re¬ 
mains open how to extend our proof techniques for the 2-Actions-And-2-Models 
case to get tight general upper bounds. 





















It is natural to compare Thompson Sampling to exponentially weighted 
algorithms, a well-known family of algorithms that can also take advantage 
of prior knowledge. If we see each model 0 S 6 > as an expert who recom¬ 
mends the optimal action based on distributions specified by 6, and use the 
prior Pi as the initial weights assigned to the experts, then the EXP4 algo¬ 
rithm has a regret of O (^KTj + ^ log , with a parameter 7 G (0,1). 

For the sake of simplicity, we only do the comparison in the 2-Actions-And- 
2-Models case. By trying to match or even beat the upper bounds in The¬ 
orem we reach the choice that 7 = \/H(pi)/T. Assuming that 9i is the 

true model, the bound becomes O ^^log small pi{9i), and 

O ^y^log (1 -Pi(6»i))r^ for large Pi{9i). Thus, although EXP4 is 

not a Bayesian algorithm, it has the same worst-case dependence on prior as 
Thompson Sampling, up to logarithmic factors. This is partly explained by the 
fact that such algorithms are designed to perform well in the worst-case (adap¬ 
tive adversarial) scenario. On the contrary, by design, Thompson Sampling takes 
advantage of prior information more efficiently in most cases, especially when 
there is certain structure on the model space 0 [9]. Note that in this paper, 
we do not impose any structure on 0, thus our lower bounds do not contradict 
existing results in the literature with non-informative priors (where p{9*) can be 
very small as 0 is typically large). 

Finally, our proof techniques are new in the Thompson Sampling literature, 
to the best of our knowledge. The key observation is that the inverse of the 
posterior probability of the true underlying model is a martingale (Lemma [^. 
It allows us to use results and techniques from martingale theory to quantify 
the time and probability that the posterior distribution hits a certain threshold. 
Then, the regret of Thompson Sampling can be analyzed separately before and 
after hitting times. 


3 Preliminaries 


In this section, we study a fundamental martingale property of Thompson Sam¬ 
pling and its implications. The results are essential to proving our upper bounds 
in Section]^ Note that a similar property holds for posterior updates using Bayes 
rule, which however does not involve action selection. 

Throughout this paper, for a random variable Y, we will use the shorthand 
Et[F] for the conditional expectation £[1^1^*]. Moreover, we denote by E®[y] 
the expectation of Y when 9 is the true underlying model, i.e., when Xi^t has 
distribution h'i{9). The notation P®[-] is similarly defined. Furthermore, we use 
the shorthand a Ab for min{a, b}. 

Lemma 1. (Martingale Property) Assume that 0 is countable and that 9* € 0 
is the true reward-generating model. Then, the stochastic process ipti9*)~^)t>i 
is a martingale with respect to the filtration 









Proof. First, recall that conditioned on 7^^, pt is deterministic. Then one has 




Z =1 




K 


K 


= =Pt{o*)-\ 


where the second last equality follows from the fact that J £i(p)(x) diy(x) = 1 


for any p G 0 . 


□ 


Consider the 2-Actions-And-2-Models case. Let A,Bg (0,1) be two con¬ 
stants such that A > pi{0i) > B. We define the following hitting times and 
hitting probabilities: ta = inf{t > l,pt{9i) > A}, tb = inf{t > l,ptl6i) < B}, 
<}A,B = < Tb), and qb.a = ^^^{ta > tb)- The martingale property above 

implies the following results which will be used repeatedly in the proofs of our 
results. 

Lemma 2. Consider the 2-Actions-And-2-Models case with Z\ > 0, where A is 
as defined in Theorem^ Then, we have ta < +oo almost surely. Furthermore, 
assume that tb < +oo and that there exists constant 7 > 0 so that Prg ( 6 * 1 ) >7 
almost surely, then 


^\ta> Tb]-P iidi) ^ 


and 


i|rA > Tb] - ^\ta < tb] 


Pi{9i) ^ ^\ta<tb] 


[Ptb (^1) ^ \ta > Tb] - E®i [pr,^ (Bi) ^ta < tb] ' 


1 —Pl(@l) 

A-B 


Finally, qB,A < and qB,A < 


Proof. We first argue that ta < +00 almost surely. Define the event E = {ta = 
+00}. Under the event E, pt{ 9 i) is always upper bounded by A for any t. Thus 


T 


Rt(0i,T5(pi)) = Z\ • E^i - ^)T. 













It follows that 


RriTSipi)) > p,ie,)RTi0i,TSip,)) > pMF^^iE)A{l - A)T. 

However, it was proven that the Bayes risk Rt{TS{pi)) is always upper 
bounded by 0{VT). Therefore we must have = 0; that is ta < +oo 

almost surely. This implies that PtaAtb (^i) is well defined and qA,B + qB,A = 1- 
Now, by Lemma 0 iPtiOi)- ^)t>i is a martingale. It is easy to ver¬ 
ify that Ta and tb are both stopping times with respect to the hltration 
Then it follows from Doob’s optional stopping theorem that for any t, 
E®4PiAr.4Ars(6'i)"^] =Pi(6»i)"^ Moreover, for any t > I, PtAr^^Ars < 7"^ 

(Note that by definition, 7 < H). Hence, by Lebesgue’s dominated convergence 
theorem, [ptArAAxB —!• [p^^atb as t ^ -foo. Thus, 

= qA,BF'^" [PTAi^i)~^\TA < Tb] + qB,AF,^^[PrB{^i)~^\TA > Tb] ■ 

The above equality combined with qA,B + qB,A = 1 gives the desired expressions 
for qA.B and qB.A- Finally, we have 

_ _ Pl(6*l)~^ -E^^[prA{dl)~^]TA < Tb] _ 

E®i[Ptb(6'i)“^|ta > Tb] - < tb] 

^ ^ B 

~ W^lprBiQlY^TA < Tb] ~ Pl{9l) 


and 


_ _ pi{0i)~'^ -F^^\prA{0l)~^\TA <Tb] _ 

^^^[Ptb{Si)~^\ta > Tb] - E®i[p^^(6>i)-i|ta < tb] 
^pi{9i)~^-l_ AB l-pi{9i) ^ l-pi{9i) 

- H-i-H-i ~ pi{9i) A-B - A-B ' 

□ 


4 Upper Bounds 

In this section, we focus on the 2-Actions-And-2-Models case. We present and 
prove our results on the upper bounds for the frequentist regret of Thompson 
Sampling. Due to space limitation, we only sketch the proof for the poor-prior 
case (first part of Theorem]^; complete proofs, including those for the good-prior 
case, will appear in a long version. 

We start with a simple lemma that follows immediate from Assumption 

Lemma 3. Under Assumption regardless of either 9i or 02 being the 
true underlying model, for any 9 G {9i,92}, s~^ ■ pt(9) < pt+i(0) < s ■ 
Pt{9) v-almost surely. 









The next lemma describes how the posterior probability mass of the true 
model evolves over time. It can be proved by direct, although a bit tedious, 
calculations. 


Lemma 4. Consider the 2-Actions-And-2-Models case. We have the following 
inequalities concerning various functionals of the stochastic process {ptidi))t>i- 


(a) For t > 1, Efi [log(pt( 6 li) i) - log(pt+i( 6 »i) i)] 

(b) Fort > 1, E^^[pt+i{0i)] > E®i[pt( 6 ii)] and 

K" \pt+i{0i) - Pt{0i)] < J2ie{i.2} PtiSi)Pt{di)pt{d2)E^^ 

(c) Fort> 1, E®^ [(1 -pt+i( 6 »i))"i - (1 -pt( 6 >i))-i] 




i6{l,2} 


„ (g.\Pt{fAwSi 


d(e2)(x.,o -L 


- 2pt%l) “ Mi(^2 )P + IA^2 (^i) - M2 (02)P- 

(d) RriBi,TS{pi)) < Z\T(1 -pi(di)). 


b(ei)(x,.t) _ ■ 


We now introduce some notation. Let A = pi{9i) — P 2 {di), Ai = |^i(0i) — 
Ml(^ 2)1 and Z \2 = |m 2 (^i) — M 2 (^* 2 )|- Obviously, A < Ai + Z\ 2 . We assume Z\ > 0 
to avoid the generated case. To simplify notation, define the regret function 
Rt(-) by Rt(pi( 6 *i)) = 11 ^( 01 , TS(pi)). Since the immediate regret of each step 
is at most A, we immediately have R'r(pi( 6 *i)) < AT. Furthermore, we have 
the following useful and intuitive monotone property, which can be proved by a 
dynamic-programming argument inspired by previous work m Section 3]. 

Lemma 5. Rt is a decreasing function ofpi{9i). 

The proofs of the upper bounds rely on several propositions that reveal in¬ 
teresting recursions of Thompson Sampling’s regret as a function of the prior. 
Although these propositions use similar analytic techniques, they differ in many 
important details. Due to space limitation, we only sketch the proof of Proposi¬ 
tion [T] 


Proposition 1. Consider the 2-Actions-And-2-Models case and assume that 
Assumption^ holds. Then for any T > 0 and pi{9i) G (0,1), we have 

Rr(p.(e,))< (961og|+6)y^ + R^(l) . 

Proof (Sketch). We recall that 9i is assumed to be the true reward-generating 
model in the proposition, and use the same notation as in LemmaFirst, the 
desired inequality is trivial if pi{9i) > | since Rt(’) is a decreasing function. 

Moreover, if Z\ < then R'r(pi( 6 'i)) < AT < which com¬ 
pletes the proof. Thus, we can assume that pi{9i) < | and Z\ > Let 

A = |pi(0i) and B = Then, it is easy to see that B < \pi{9i) < 

\<l-A. 





















Now, the first step is to upper bound A tb — 1]. By Lemma 

have for t A tb — 1 that, 

[log(pt(6'i)"^) - log(pt+i(6'i)"^)] > ]^pt{ 0 i)pt{ 02 f^l + ]^Pt{ 02 f‘^ 


we 


> 


(Al + Al) > 


Pt{d2)^B 2 I a2 


In 


other words, (^log(pt(6»i) 


2 ^ - 16 
is a supermartingale. Applying 


t<TA^TB 

Doob’s optional stopping theorem to the stopping times cti = t A ta /\ tb and 
(72 = 1 and letting t —> +oo by using Lebesgue’s dominated convergence theorem 
and the monotone convergence theorem, we have 


[ta a tb - 1] < 


< 


16 


log 


sA 


16 

16 




log 


Pta/\Tb (^l) 

^ 1 ( 6 * 1 ) 


BA^ "’pi( 0 i) 


1 3s 
SA 2 2 ’ 


where we have used Lemma in the second last step. 

Next, the regret of Thompson Sampling can be decomposed as follows 


Rb(pi( 0 i)) 

=A ■ E®1 [ta a Tb - 1] + qB,A ■ E®^[R-rbrs iOi))\TA > Tb] 
+ qA,B ■ E®i[RT(pT^(6»i))|rA < tb] 


16 , 3s B 


1 , 3s 

= ( 16 log y + 1 


AT + Rt i^Piidi) 


Pii^i) 



where in the second last step, we have used the facts that qB,A < pifsi) 
Lemma 2 ), Pta(^i) A A = |pi(0i), and Rt(-) is a decreasing function (by 
Lemma 5). Because the above recurrence inequality holds for all pi(0i) < 
simple calculations lead to the desired inequality. □ 


Using similar proof techniques, one can prove the following recursion: 

Proposition 2. Consider the 2-Actions-And-2-Models case and assume that 
Assumption^ holds. Then, for any T > 0 and pi{9i) < we have 

With the technical lemmas and propositions developed so far, we are now 
ready to prove the first upper bound of Theorem for p small. The second 
bound for large p can be proved in a similar fashion, although the details are 
quite different [24] . 


















Proof (of the first part in Theorem^. For convenience, define /3 = 96 log ^ + 6. 
By Propositions [2 and 

RrQ) <(144s + l)^/T+iRT(^l) 

< (144s + 1)Vt + + ^Rt Q) . 

Therefore, 

Rt < ( 288 s + /S^Gs + 2 ) Vf. 

Using again Propositionone has for any pi{9i) S (0,1), 


Rt(pi(6'i)) < Pi 




< 


< 


< 


f^\ + f288s + pV^ +2]Vf 

+ 2 ) 

["288s + /3('\/6s + 1) + 2^1 A — 

^ 2 ' y Pi (01 


< 1490si 


Pi(^'i) 


where the last step follows from the inequalities P = 96 log ^ + 6 < 300-\/s and 
■\/6s + 1 < 4-y/s for s > 1. □ 


5 Lower Bounds 

In this section, we give a proof for the lower bound when the prior is poor 
(Theorem ; the other case (Theorem is left in the long version [53] . The 
following technical lemma is needed, which can be proved by direct calculations: 

Lemma 6. Let — 

the Bernoulli distributions Bern (4 + Z\) and Bern (4 — Z\ 


< A < \/ a- Let l\ and £2 be the density functions of 


ii(X) 

U(x) 


- 1 


with respect to the 

< 32Z\2. 


counting measure on [0,1]. Then Ex~Bern(i+zi) 

Proof (of Theorem^. Let A = |pi(0i). Clearly, A < ^. Recall that ta = 
inf{t > l,pt(^*i) ^ Using Lemma|^b) and Lemma|^ one has for t < — 1, 

E^[pt+i(0i)-pd0i)] 


< E M)Pt{0i)Pt{92W^^ 

ie{l,2} 


£i{9i){X,,t) 

£i(92)iX,,t) 


- 1 


£i{02)iXi,t 


- 1 


< 32AM^ = 72pi(0i)MU 























t<TA 


IS 


supermartin- 


Therefore, {pt{9i) — 12pi{9iY 
gale. Now, using Doob’s optional stopping theorem, one has 
[ftAr..AT(0i)-(<AT^AT)72pi(0i)M2] < Pi(0i) - 72pi(0i)M2 for 

any t > 1. 

Moreover, using Lebesgue’s dominated convergence theorem and the mono¬ 
tone convergence theorem, 

[ptAr^AT(0l) -{thTAh T)72pi(0i)2Z\2] 

^ [pr^AT(0i) - {ta a T)72pi{9^fA^] 
as t ^ - 1 - 00 . Hence, 

a T - 1 ] > [PrAATi9l)-pi{9^)] . 

One one side, ifP®^(r^AT = T) > ^, then E®i [r^AT] > P®i(r^AT = T)T > 

On the other side, if P®i(T^Ar = ta) > ff, then E®i [Pt. 4 At(^i)] > V^^{ta/\T = 
ta)A > ^pi{9i) and thus 


E®i [ta A T - 1] > 


72pi(6»i)2Z\2 V 7 


10 


-^Pii9i) - pi{9i) = —. 


T 


In both cases, we have E^^ [ta A T — 1] > 


Finally, one has 


Rt{9„TS{pi)) = AE^ 


J2{l-Ptm 


> A{1-A)E^^[taAT-1] > 


21 


rr^AT-l 

>Z\E^^ ^ il-Pt{9,)) 

. t=i 

AT_ 1 / T 

84 “ 168a/ 2V Pi(6'i) ’ 


where we have used the fact that 1 — H ^ • 

Proof (of Theorem^. Using Lemma|^c) and Lemma|^ one has 


□ 


E, 


[pt+i{92) ^-Pt{92) 


Pti^l) jg 6 li 


*£{ 1 . 2 } 

= Pt{9i)E^ 


U9i){X,^t) 

U02){X,^t) 


- 1 


hi9i)iX2,t 

h{92)iX2,t 


- 1 


< 32Z\A 


Then for any t <T, 


1-11 


< 


1 . ^ .9 1-F4(t-1)/T 

-h 32(t - 1)Z\2 = —r-^ < 


E®^ [pt(6»2; J ^ -y- - .... ^ X 

l-pi[9i) l-pi(9i) l-pi[9i) 

By Jensen’s inequality, we have for any t < T, E®'^ [pt(02)] > 


(E^4ft(02)-']) > 


-il\“i ^ i-pi(gi) 


. Hence, 


T 

Rt{9,,TS{pi)) = A-E<^^ ^Pt{92) > AT^-AI^ > - p^{e,))T. 




□ 
























6 Conclusions 


In this work, we studied an important aspect of the popular Thompson Sampling 
strategy for stochastic bandits — its sensitivity to the prior. Focusing on a special 
yet nontrivial problem, we fully characterized its worst-case dependence of regret 
on prior, both for the good- and bad-prior cases, with matching upper and lower 
bounds. The lower bounds are also extended to a more general case as a corollary, 
quantifying inherent sensitivity of the algorithm when the prior is poor and when 
no structural assumptions are made. 

These results suggest a few interesting directions for future work, only four of 
which are outlined here. One is to close the gap between upper and lower bounds 
for the general, multiple-model case. We conjecture that a tighter upper bound 
is likely to match the lower bound in Corollary The second is to consider prior 
sensitivity for structured stochastic bandits, where models in 0 are related in 
certain ways. For example, in the discretized version of the multi-armed bandit 
problem |3], the prior probability mass of the true model is exponentially small 
when a uniform prior is used, but strong frequentist regret bound is still possible. 
Sensitivity analysis for such problems can provide useful insights and guidance 
for applications of Thompson Sampling. Thrid, it remains open whether there 
exists an algorithm whose worst-case regret bounds are better than those of 
Thompson Sampling for any range of pi{9*), with 9* being the true underlying 
model. This question is related to the recent study of Pareto regret front [22]. We 
conjecture that the answer is negative, especially in the 2-Actions-And-2-Models 
case. Finally, it is interesting to consider problem-dependent regret bounds that 
often scale logarithmically with T. 


Acknowledgments We thank Sebastien Bubeck and the anonymous reviewers 
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Appendix to 

On the Prior Sensitivity of Thompson Sampling 

A Technical Lemmas 

A.l Proof of Lemma 

Proof (of Lemma^. Without loss of generality, we assume that 9 = 0i. Recall 
that 


Pt+ijdi) ^ _ 

Pt{dl) + Pt(S2)tlt{92){Xi^^t) 

1 

//D ^ I (O ^ (^2) ,t) 

PtiOl) + Ptid2)jJ^Jg^Jlx7(f;) 

Therefore, we have 

1 1 Pt+i(^i) 1 

s “ Pt{di) + Pt{02)s ~ Pt{dl) ~ Pt{0l) +Pt{92)l 


which completes the proof. 


□ 


A.2 Proof of Lemma |4] 

The proof of Lemma key to the upper-bound analysis, relies on the following 
result: 

Lemma 7. Let a G [0,1]. Let vi and 1^2 be two probability distributions on [0,1] 
with mean pi and p 2 , Then we have 

KL{vi, avi -f (1 - a)v2) > \pi - p. 2 ?■ 

Proof. Let and 1^2 be absolutely continuous with respect to some measure v 
with density functions £i and £ 2 - On one hand, by Pinsker’s inequality, we have 

RLfui^avi -I- (1 - a)v2) > ^ (^J |£i(a^) - Q:£i(a:) - (1 - a)t2{x)\dv{x) 



On the other hand. 


\Pl - P2\ = 


(£i(x)x — £2 (x)x) dv(x) 


< [ \£i{x) - £ 2 {x)\dv{x) 
Jo 


which completes the proof. 

We are now ready to prove Lemma 


□ 












Proof (of Lemma^. Recall that for the 2-Actions-And-2-Models, 


Pt+iiOi) = 
Pt+l{ 82 ) = 


Me,)ii,{e,){Xj,,t)+Pt{e2)iiA02){Xi,,ty 

pt( 82 )^it{(^ 2 ){Xi^y 


and It = i with probability pt{Qi) for i € {1,2}. We carry out the following 
computations to prove the lemma. 

(a) 

[log(pt(6ii)-i) - log(pt-Hi(6'i)"^)] 


= E?i 


log 


= pt( 0 ,)E; 

i6{l,2} 


iiM){Xu,t) 

Pt{8i)iit{0i){Xi^y + Pt{d2)^it{82){Xi^y 

UeiKx.^t) 


log 


PtieiM0i)(x,y+pt(92)U02Kx,,t)i 


= ^ Pti0i)KL{l2i{0i),Pti0l)l'i{0l) +Pt{02)l^^{02)) 

ie{l,2} 

i6{l,2} 

where the last step follows from Lemma 

(b) 

[{i-p,_,yey)-^-{i-pt{0i))-y 

= E^ [pt+i{02)-^ -Pt{02)-y 

\pt{0i)eiy0i) (Xi, ,t)+Pt{02)iiA02)iXi,,t) 


= E?i 


Pt{02)iit{02){Xi^y 

W)iXi„t) 


Pt{02) 


Pt 1081 

Pti02) ‘ [iiA02)iXi,y 


- 1 


pt{0i) 

Pt{02) 


E^i 


ie{l,2} 

> E 


- 1 


ie{l,2} 

Pt{01? 

Pt{02) 

Pt{0iy 

^(02) 


Pt{02) 


yi0i){X,,t 

Ai02)iX,,t) 

U0i)(X,,t) 


log 


A{<^2){X,,t) 


KL{vi(0i), Vl{02)) + Pt{0l)KL{v2{0l), V2{02)) 

\pi{0l) - Pl{02)\^ + ^\P2{01) - P2{02)\^ 


where we have used the inequality a: — 1 > log x and the last step follows from 
Lemma 0 





























(c) 




= ft( 0 i)E^ 




'* pt{e^)£iA0i){Xj,,t)+Pt{e2YiAe2)(Xi,,t) 


- 1 


= E ptmpt{o,)K^ 

i6{1.2} 


.Pt{Ol)f-i{Ol){Xi^t) + Pt{Q2)f-i{02){Xi^t) 
On one hand, using the inequality a; — 1 > logx, we have 

E^[p*+l( 0 l)-pt( 0 l)] 


- 1 


i6{l,2} 


log 




Pt{0Mei){x,^t)+Pt{e2%{e2){x,^t)\ 


= E +Pi(^ 2 )^'*( 6 ' 2 )) > 0 . 

ie{l,2} 

On the other hand, using Jensen’s inequality on the convex function x —>■ x~^, 
one has 


E?Mp*+i( 0 i)-p*( 0 i)] 

- E 

i6{l,2} 


(■i{dl){Xi^t) 


Pt{dl 


Pt{02 


= E Pt{ei)pt{ei)pt{02)^' 
ie{l,2} 




- 1 


A{02){x,^t) 

(d) By definition of the regret and part(c), one has 

T T 

Rt{0^,TS{pi)) = ZIE®^ Ep‘(^ 2) = ^(i _p,(0i)) < Z\T(1 - pi(0i)). 


□ 


A.3 Proof of Lemma 

Proof (of Lemma^. The proof is inspired by the dynamic-programming argu¬ 
ment used in Section 3 of a previous study m- We assume that 9i is the true 
reward-generating model. For arm i S {1,2}, define R^\a) as the regret of the 
policy that starts with the prior pi = (a, 1 — a), plays arm i for the first step, 
and then executes Thompson Sampling for the remaining T — 1 steps. It is easy 
to see that 

Rria) = aR^^\a) + {1 - a)R^r^\a). (3) 

We now prove by induction on T that Rt{‘) is a decreasing function. For the 
base case of T = 1, R{a) = 1 — a is obviously decreasing. Now, suppose 


















is decreasing for all t < T, and we will show that Rt{-) is also decreasing. The 
proof proceeds in three main steps. 

Step One: This step is devoted to showing that both and are decreas¬ 
ing functions of a. By definition, we have 


Rip''(a) - 


Rj'— 


aiiiei){Z) 


Rp'' {a) = A + 


a£ii9i){Z) + (1 - a)iii02)iZ) 

JXT— 


' \a£2i0i){Z) + (1 - a)£2{92){Z) 


Since Rr-iia) is decreasing with a S (0,1), it follows that 


Ri^\a) = Ezr^p^{e^) 


Rt-1 


h{9i){Z) 


£i{9i){Z) + {l/a-l)£i{92){Z))\ 


is a decreasing function of a. Similarly, R)p'{a) is also a decreasing function. 
Step Two: This step is to show that the functions Rp'' and Rp'' satisfy 


Rip {a) < R'p'ia) 


( 2 )/ 


(4) 


for any T and a G (0,1). We prove the claim by mathematical induction on 
T. The base case where T = 1 is trivial, since = 0 and RP{a) = A. 

Now suppose RP(a) < RP(a) for all t < T. Then for every t < T, < 

Rt{a) < RP{a), because of Equation]^ and the induction hypothesis. It follows 
that, 


Rip'{a) = Ezr^p^{9^) 


Rt-1 


a£i{9i){Z) 




K 


( 2 ) 


T-1 


a£i(0i)(Z) + (l-a)£i(02)(^);j 

a£i{9i){Z) 


a£,{9,){Z) + {l-a)£i{92)(Z) 


= Z\-bE 




Ez' 




xW 


where 




a£i{9i){Z)£2{9i){Z') 


and that 

Rp\a) = A + Ezr^p2(e^) 
> A + Ez^p^(d^) 
= Z\ -I- Ez^p^(Si) 


{9,){Z)£2{9,){Z') + (1 - a)£i{92)iZ)£2i92)iZ') 

a£2i9,){Z) 


Rt-1 


' [a£2i9i){Z) + (1 - a)£2i92)(Z) 


R. 


( 1 ) 

T-1 


a£2{9,){Z) 


a£2{9i){Z) + (I - a)£2{92)iZ) J\ 




X(2) 
































where 


= Rt-2 


ah{9i){Zy2i0i){Z) 


ah{9i){Z')i2{ei){Z) + (1 - a)ii{92)iZ')e2{92)iZ) 


Thus, R^\a) > R^p\a) by Fubini’s theorem. 

Step Three: This step finishes the induction step, based on results established 
in the previous two steps. For any 0<a</3<l, we have 

rt{/ 3) = + a - 

</3R^r\a) + {l-P)R!^\a) 

< aR^\a) + (1 — a)R^^\a) 

= Rt{oi) , 


where the equalities are from Equation the first inequality is from the mono¬ 
tonicity of Rx\’) established in Step One, and the second is from Equation]^ 
We have thus proved that Rt{-) is a decreasing function for t = T, and finished 
the inductive step. □ 


A.4 Markov Property 

Another fundamental, although intuitive, property of Thompson Sampling is 
that the posterior distribution it maintains over the set of models forms a Markov 
process. This property is used in the proofs of multiple propositions in later 
sections. 

Lemma 8. (Markov Property) Regardless of the true underlying model, the 
stochastic process {pt)t>i is a Markov process. 

Proof (of Lemma^. Let 9* be the true underlying model. Recall that 

p,{9)iiM{Xia) 

Note that R is drawn from pt independent of the past and Xi^t is drawn 
from pLiiJ)*). Hence, the distribution of pt+i only depends on pt and pi{9),i = 
\,...,K,9 e 0. The reward distributions are fixed before the evolution 

of the process pt- Thus, the distribution of pt+i only depends on pt, not on 
Ps,s = 1,... ,t — 1. This shows that pt is a Markov process. □ 

B Proof of Corollary 

Proof (of Corollary^. Let pi be the prior over {0i,02} defined as pi(0i) = 
Pi{9*) and pi(02) = Pi(0\{^*})- By Theorem]^ there exists a 2-Actions-And-2- 
Models problem instance V (defined by Vi{9j),i,j = 1, 2) where the regret of TS 




with prior pi is ^{\J p-Jg-^) ) small PiiOi). Now consider the problem instance 

Q for the general 0 case defined as i^i{0*) = Vi{9i) for i = 1,2 and = i'i{92) 
for i = 1,2 and 9 G 0\{9*}. It is easy to see that Thompson Sampling with 
prior Pi under Q has exactly the same regret as Thompson Sampling with prior 
Pi under V. Thus, under Q, the regret of Thompson Sampling with prior pi 
is Pi(^*)- The f2(^(l-pi(9*))T) lower 

bound for large Pi(9*) can be similarly obtained. □ 


C Proof of Theorem [T] 


In the main text, we only sketch the proof for the first part of the theorem. Here, 
a full proof is presented, which requires an additional proposition that plays a 
similar role as Propositions and Its proof is given in Section 


Proposition 3. Consider the 2-Actions-And-2-Models case and assume that 
Assumption \l\ holds. We also assume that A > , ^ and define the 

function Qt(’) by Qt(x) = Rt(1 — x). Then for any T > 0 and pi(02) < 
we have 


Qt(pi(^2)) - Qt(^Pi(02)) 

< 360s^\/pi(02)T + (Qt (4s^pi(02)) — Qt(pi(02))) • 

Proof (of Theorem^. 

Proof of the First Inequality: Let /3 = 96 log ^ + 6. By Propositions 111 and 

0 


Rr 



<(144. + 1)v^+1r,(1) 

< (144s + 1)Vt + + ^Rt Q) . 


Therefore, 












Using again Propositionone has for any pi{9i) G (0,1), 


Rt(pi(6'i)) < /3^ 

<P\ 




+ Rt 


Pi 


^ + f288s + /3v^ + 2)Vf 

[&i) V / 




< (288» + + 1) + 2 ) 


< 1490s4 


T 


Pi{0i) 


where the last step follows from the inequalities ^ = 96 log ^ + 6 < SOOy^ and 
+ 1 < 4-y/s for s > 1. 

Proof of the Second Inequality: Fix pi{0i) > 1 — First, if 

Lemma |4j;d), Rt(pi(6'i)) < (1 - pi(6»i))Z\T < 

= . It follows from 


\/(l — pi(di))T. Hence, we can assume that A > — 7 = 

’ - ^p^-vASi))T 

Proposition 1^ that for any integer /i > 1, as long as (4s^)^“^pi(02) < one 

has 


Qt(pi(02)) - Qt(^Pi(02)) 

^ E (^)'36Osy(4s2)Vi(02)T+ (^)"qt((4s2)Vi(02)) 
^E(n) 360s4v/p^(W+ Qt((4s2)Vi(02)) 

< 132OsVpi(02)T+ Qt((4s")Vi(02)). 

Let h be the smallest integer such that (4s^)^pi(02) > g^- On one hand, 
(4g2)?i-ipi(^2) < gU implies that 1 — (4s^)^pi(02) > Using the first in¬ 
equality of Theorem]^ and the fact that the function Rt(-) is decreasing, one 
has 


Qt((4s2)Vi(02)) = Rt(1 - (4s2)V(02)) 

< Rt(^) < 1490sv^. 






















On the other hand, ( 4 s^)^pi(d 2 ) > implies that 2\/2s^/pjJ(^ > > 

(ifs)^ ■ Hence, for pi( 6 » 2 ) < 

Qr(pi(02)) - QH^pM) < (13205^ + 596Os2)Vpi(02)T 

< 7280sVpi(«'2)T. 


Thus, for any integer m, one has 


Qt(pi( 6 ' 2 )) < ^ 7280s^^P i(^2)T + Qt 

m-l . .k , 

7280sVpi(^2)T + Rt I 1- 

i—n ^ ^ 




^ 1 X m 

< 14560sV pi(^ 2)T+ f Pi(02)^r, 

where we have used LemmaQd) in the last step. Finally, letting m go to infinity, 
we get 

Qt(pi(02 )) < 14560sV pi(^ 2 )T, that is, Rt(pi(0i)) < 14560sV(1 "Pi 

□ 


D Proof of Proposition 


Proof (of Proposition^. In this proof, we consider the case where di is the true 
reward-generating model. We use the notation defined in Lemma First, the 
desired inequality is trivial if pi( 6 *i) > g since Rt(') is a decreasing function by 
Lemma Let pi{0i) < 1, A = |pi(0i) and take B > 0 such that B < gPi(0i). 
The exact value of B will be specified later. It is easy to see that A < ^ and 
B < ^ < 1 — A. We decompose the rest of the proof into three steps. 

Step One: This step is devoted to upper bounding [ta A — 1]. Note that 
by the definition of ta and tb, one has for t < ta A — 1, B < pt(0i) < A < g 
and Pt(^ 2 ) >1 — H>1>B. Thus, by Lemmaj^a), we have for t < ta A — 1, 


Eji [log(pt( 6 »i) 1 )-log(pt+i( 6 »i) ^)] 

> ^Pt{di)pt{d2fAj + ^pt{92)^Al 


> 


Pt{e2YB 

2 


{Al^AD > 


BA^ 

16 


where we have used A^ + A 2 > ^(Ai + A 2 Y A Rearranging, we get for 

t < ta A tb - 1, 


e: 


iog{pt+i{0i) Y + Y + Y 


BA^ 


<log(pt( 6 »i) ^)+t 


BA^ 


16 


16 
















In 


other words, |^log(pt( 01 ) ^)+i—(f-) is a supermartingale. 

Now, using Doob’s optional stopping theorem, one has for any t > 1, 


E'" 


l0g(ptAr^ArB (^l) ) + (t A A Tb) 


BA^ 

16 


<log(pi(0i) ) + 


BA^ 

16 


Also, by Lemma i, log(ptArAArB(0i) < log(f) for any t < 1. Using 
Lebesgue’s dominated convergence theorem and the monotone convergence the¬ 
orem, 




fog(PtArAATs (^'l) ) -b (< A ta A Tb) 


BA^ 


E'" 


fog(Pr^ArB (6'l) ) + (ta a Tb) 


16 
BA^ 
16 


as t —)■ + 00 . Hence, 


1 

E^^[TAArB-l]< ^E«^ 



Pta/\tb (^ l ) 


16 

- SZ \2 


log 


sA 

Pi{0i) 


16 

BA^ 


log 


3s 

Y’ 


where we have used Lemma in the second last step. 

Step Two: In this step, we establish a recurrence inequality for the regret func¬ 
tion Rt(-)- By Lemma {Pt{di))t>i and (pi(02))i>i are both Markov processes. 
Thus, the regret of Thompson Sampling can be decomposed as follows 
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where in the last step, we have used the facts that qb.a < Lemma j^, 

Pta{(^i) > ^ and Rt(-) is a decreasing function (Lemma [^. 

Step Three: The recurrence inequality established in the previous step and an 
appropriate choice of the parameter B allow us to get the desired upper bound on 
R.t(pi(0i))- On one side, if Z\ < then Rt(pi(0i)) < AT < 2^^^. 

On the other side, if Z\ > we take B = This choice of B 





























is eligible since Then for any pi{0i) < 

KripM)) < + ^-^r + R, (^„( 0 .)) 

= (leiog^ + 1) y'^ + R^ (5p.(9,)). 

It follows that for any integer h > 1, as long as (|)^ ^Pi(^i) < one has 

Rr(p.(9.)) < E (l61og| + 1 ) ((I) 

<(96bg|+6)y^ + R^(^(^?) 

Finally, by taking h to be the smallest integer such that (|)^ Pi{0i) > 5 and 
using the fact that the function Rt(-) is decreasing (Lemma^, we get 

Rt(„( 9,)) < (961og|+6) yC^ + R,, , 

which completes the proof. □ 


E Proof of Proposition 

Proof (of Proposition^. In this proof, we consider the case where 9i is the 
true reward-generating model. We use the notation defined in Lemma Fix 
T > 0 and pi{0i) < Let B = ^pi{9i) and take A > pi{9i). The exact value 
of A will be specified later. We decompose the proof into three steps. 


Step One: This step is devoted to upper bounding A — I]. By 

Lemma l^c), we have for t < A tb — 1, 
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Now, using Doob’s optional stopping theorem, one has for any t > 1, 
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for any t < 1. Using Lebesgue’s dominated convergence theorem and the mono¬ 
tone convergence theorem. 
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Step Two: In this step, we establish a recurrence inequality for the regret func¬ 
tion Rt(-)- By Lemma {Pt{0i))t>i and (pt(02))t>i are both Markov processes. 
Thus, the regret of Thompson Sampling can be decomposed as follows 
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where in the last step, we have used the facts that qB,A < pifsi) ~ 5 
Lemma [^, Ptb(^i) > f = ^Pi(^i) (by Lemma |^, RT(pr^(6'i)) < (1 - 



















(1 — A) AT (by Lemma Qd)) and Rt( •) is a decreasing function 

Step Three: Finally, we establish the desired recurrence inequality by appro¬ 
priately choosing the value of A. On one side, if Z\ < then R'r(pi(di)) < 

AT < 2\/T. On the other side, if Z\ > we take A = 1 — This choice 
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F Proof of Proposition 

Proof (of Proposition^. In this proof, we consider the case where 6i is the true 
reward-generating model. We use the notation defined in Lemma Fix T > 0 
and pi{0i) > 1 - g^. Let A = 1 - 4^(1 - Pi( 6 'i)) and B = 1- 4s{l - pi{9i)). 
Then it is easy to see that A > pi{9i) > B and B > ^. The proof is decomposed 
into two steps. 

Step One: This step is devoted to upper bounding A — 1]. By 

Lemma l^c), we have for t < A tb — 1, 
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Step Two: In this step, we establish the desired recurrence inequality. By 
Lemma (pt(0i))t>i and (pt(02))t>i are both Markov processes. Thus, the 
regret of Thompson Sampling can be decomposed as follows 
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where in last two steps, we have used the definition of A, B and the facts that 
PrAi^i) > Prsi^i) = 1 -Ptb(^ 2 ) > 1 - s(l - R) (Lemma |§ and Rt(-) is a 
decreasing function (Lemma [^. 
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Therefore, we obtain the desired recurrence inequality by observing that 
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□ 


In this section, we give empirical evidence that the actual regret incurred by 
Thompson sampling is consistent with what theory predicts. In particular, we 
show that the regret does indeed scale linearly with ^Jljp and y^I — p, respec¬ 
tively, for the good- and bad-prior cases, where p is the prior probability mass 
of the true model. 

We consider the 2-Actions-And-A^-Models case, with Bernoulli rewards and 
6*1 being the true model. The bandit problem is as described in Theorems and 
when N = 2 with poor- and good-priors, respectively. However, we make a 
more natural choice of fixing A = 0.05, as opposed to making it a function of 
p and T (required by the theorems). Furthermore, when N > 2, we introduce 
randomness into { 02 , ■ ■ •, 6 * 7 v} to generate different models as follows. For the 
poor-prior case, the reward is 0.5 for a = 1 and 0.5 — A' for a = 2, where 
A' ^ Unif[Y, ^]; the good-prior case is constructed similarly. Therefore, under 
every model other than 9i, the optimal action is a = 2 , whose per-step regret is 
A (since 9i is actually the true model and a = 1 is the true optimal action). 

We place a prior probability mass p > 0 to 0i, and assigns the rest of prob¬ 
ability mass uniformly on the other iV — 1 models. We run Thompson sampling 
with this prior, denoted pi, for T = 10000 steps; the cumulative regret over 
the T steps is averaged over 2000 independent runs of the algorithm, to yield a 
reliable empirical estimate of R'r(0i, TS(pi)). 

Figure shows the relation between R 7 ’( 0 i,TS(pi)) and p, for both 
the good- and bad-prior cases, with N G {2,5}. The p-axis is the aver¬ 
age cumulative regret. The left panel has i/l/p as the x-axis, with p G 
{0.001,0.002,0.005,0.01,0.02,0.05,0.1}; the right panel has y^l — p as the x- 
axis, with p G {0.995, 0.998, 0.999,0.9995,0.9998,0.9999}. As predicted by our 
upper/lower bounds, both plots show a scaling that is nearly linear, especially 
for the small-p case. For the large-p case, the linear effect is more prominent 
when p gets close to 1 (that is, towards left end of the a;-axis), as suggested by 
Theorem [H 

More interestingly, we can see a similar scaling for = 5, although we have 
not provided the corresponding upper bounds in this work. These empirical 
results indicate that the lower bounds in Corollary may be tight, while the 
upper bound derived directly from previous results m (see Section |2.1[ ) may 
not. 










Fig. 1. Empirical cumulative regret Rt(0i, TS(pi)), averaged over 2000 runs, for the 
poor-prior (left) and good-prior (right) cases. The y-axis is Rt(6^i, TS(pi)). The a;-axis 
is \J'i/p for the left panel, and — p for the right. 



